Accessing geospatial data the easy way (Python)¶
The access to geospatial data has changed significantly over the past decade. Data has traditionally been accessed by downloading several files to a local computer, then analyzing them with software or programming languages. It has always been difficult to access analysis-ready datasets due to the diversity of data formats (NetCDF, Grib2, Geotiff, Shapefile, etc.) and the variety of access protocols from different providers (Opendap, HTTPS, SFTP, WPS, API Rest, Datamarts, etc.). Beyond that, with the ever-increasing size of geospatial datasets, most modern datasets cannot even fit on a local computer, limiting science’s progress
The datasets presented here are large-scale analysis-ready cloud optimized (ARCO). In order to implement an entry point for a list of datasets, we have followed the methodology developed by the Pangeo community, which combines multiple technologies: - Data Lake (or S3, Azure Data Lake Storage, GCS, etc.) : distributed file-object storage - Zarr (or alternatively TileDB, COGs) : chunked N-dimensionnal array formats - Dask (or alternatively Spark, Ray, Distributed) : distributed computing and lazy loading - Intake Catalogs (or alternatively STAC) : a general interface for loading different data formats, mostly but not limited to spatiotemporal assets
For more information, please refer to the pangeo’s website
It is important to keep in mind that the majority of the datasets in the catalogue have language-agnostic formats, making them accessible through a variety of programming languages (including Python, Julia, Javascript, C, etc.) that implement the specifications for these formats (such as Zarr, netcdfs (kerchunk), geojson, etc.).
[1]:
from distributed import Client
import intake
import hvplot.xarray
import hvplot.pandas
from dask.distributed import PipInstall
import xoak
import xarray as xr
import numpy as np
import pandas as pd
Dask client¶
We use a Dask client to ensure all following code compatible with the framework run in parallel
[2]:
client = Client()
client
[2]:
Client
Client-0c257f1c-66a7-11ed-89c7-00224851794f
| Connection method: Cluster object | Cluster type: distributed.LocalCluster |
| Dashboard: http://127.0.0.1:8787/status |
Cluster Info
LocalCluster
a6549329
| Dashboard: http://127.0.0.1:8787/status | Workers: 2 |
| Total threads: 2 | Total memory: 6.78 GiB |
| Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-4ea6bf18-fd4d-4cc0-85a1-d5c7308e900c
| Comm: tcp://127.0.0.1:41043 | Workers: 2 |
| Dashboard: http://127.0.0.1:8787/status | Total threads: 2 |
| Started: Just now | Total memory: 6.78 GiB |
Workers
Worker: 0
| Comm: tcp://127.0.0.1:45751 | Total threads: 1 |
| Dashboard: http://127.0.0.1:39069/status | Memory: 3.39 GiB |
| Nanny: tcp://127.0.0.1:41407 | |
| Local directory: /tmp/dask-worker-space/worker-a5j1z3_5 | |
Worker: 1
| Comm: tcp://127.0.0.1:37239 | Total threads: 1 |
| Dashboard: http://127.0.0.1:41017/status | Memory: 3.39 GiB |
| Nanny: tcp://127.0.0.1:44951 | |
| Local directory: /tmp/dask-worker-space/worker-21ho7a07 | |
Intake catalogs¶
Intake is a lightweight package for finding, investigating, loading and disseminating data. A cataloging system is used to organize a collection of datasets and data loaders (drivers) are parameterized such that datasets are opened in the desired format for the end user. In the python context, multi-dimensional xarrays could be opened with xarray’s drivers while polygons (shapefiles, geojson) could be opened with geopandas.
Here is the URL where you can access the catalog:
[3]:
catalog_url = 'https://raw.githubusercontent.com/hydrocloudservices/catalogs/main/catalogs/main.yaml'
cat=intake.open_catalog(catalog_url)
cat
main:
args:
path: https://raw.githubusercontent.com/hydrocloudservices/catalogs/main/catalogs/main.yaml
description: Master Data Catalog
driver: intake.catalog.local.YAMLFileCatalog
metadata: {}
In order to arrange the collection of datasets, the catalogue itself makes references to various sub-catalogs:
[4]:
[cat[field]
for field in list(cat._entries.keys())]
[4]:
[<Intake catalog: hydrology>,
<Intake catalog: atmosphere>,
<Intake catalog: geography>,
<Intake catalog: climate_change>]
Even though our catalogue is constantly expanding, some datasets are already available. The next sections contain several examples of queries as well as analyses of various ones.
The current (flattened) catalogue is described in the table below. A dataset should be used after consulting the status field. If a dataset has a “dev” flag, it signifies that we are actively working on it and do not recommend using it. It is production-ready if it has a “prod” flag. The “prod” label signifies that the dataset has undergone quality review and testing, however users should always double-check on their own because errors are still possible.
[5]:
pd.set_option('display.max_colwidth', None)
pd.DataFrame([[field ,
dataset,
cat[field][dataset].describe()['description'],
cat[field][dataset].describe()['metadata']['status'][0]]
for field in list(cat._entries.keys())
for dataset in cat[field]._entries.keys()],
columns=['field', 'dataset_name', 'description', 'status']) \
.sort_values('field')
[5]:
| field | dataset_name | description | status | |
|---|---|---|---|---|
| 1 | atmosphere | era5_reanalysis_single_levels | ERA5 hourly estimates of variables on single levels chunked for time series analysis | prod |
| 2 | atmosphere | era5_reanalysis_single_levels_spatial | ERA5 hourly estimates of variables on single levels chunked for spatial analysis | dev |
| 3 | atmosphere | era5_land_reanalysis_spatial | ERA5-Land hourly estimates on single level chunked for spatial analysis | dev |
| 4 | atmosphere | era5_reanalysis_pressure_levels | ERA5 hourly estimates of variables on pressure levels | prod |
| 5 | atmosphere | daymet_daily_na | Daymet Data Version 4.0 | prod |
| 6 | atmosphere | ghcnd_world | Global Historical Climatology Network daily (GHCNd) | dev |
| 7 | atmosphere | scdna | SCDNA a serially complete precipitation and temperature dataset for North America from 1979 to 2018 | prod |
| 8 | atmosphere | 20_century_reanalysis_single_levels | NOAA-CIRES-DOE Twentieth Century Reanalysis (20CR) on single levels spanning 1836 to 2015 chunked for time series analysis | prod |
| 9 | atmosphere | 20_century_reanalysis_single_levels_large_area | NOAA-CIRES-DOE Twentieth Century Reanalysis (20CR) on single levels spanning 1836 to 2015 chunked for spatial analysis | prod |
| 10 | atmosphere | 20_century_reanalysis_pressure_levels | NOAA-CIRES-DOE Twentieth Century Reanalysis (20CR) on pressure levels spanning 1836 to 2015 chunked for time series analysis | prod |
| 11 | atmosphere | 20_century_reanalysis_pressure_levels_large_area | NOAA-CIRES-DOE Twentieth Century Reanalysis (20CR) on pressure levels spanning 1836 to 2015 chunked for spatial analysis | prod |
| 12 | atmosphere | terraclimate | TerraClimate is a dataset of monthly climate and climatic water balance for global terrestrial surfaces from 1958-2019 | prod |
| 14 | climate_change | rcp45_day_NAM_22i_raw_zarr | NA-Cordex (limited to rcp45 for now... more to come!) | dev |
| 13 | geography | melcc_polygons | MELCC basin delimitation | dev |
| 0 | hydrology | melcc | CEHQ daily flow and water levels | dev |
1) Atmosphere datasets¶
a) ERA5 single levels¶
ERA5 is the fifth generation ECMWF atmospheric reanalysis of the global climate covering the period from January 1950 to present. ERA5 is produced by the Copernicus Climate Change Service (C3S) at ECMWF.
Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product.
Property |
Values |
|---|---|
Temporal extent: |
01/01/1979 – 12/31/2020 |
Spatial extent: |
World : [-180, 180, -90, 90] |
Chunks (timeseries’s version): |
{‘time’: 14880, ‘longitude’: 15, ‘latitude’: 15} |
Chunks (spatial’s version): |
{‘time’: 24, ‘longitude’: 1440, ‘latitude’: 721} |
Spatial resolution: |
0.25 degrees |
Spatial reference: |
WGS84 (EPSG:4326) |
Temporal resolution: |
1 hour |
Update frequency: |
In 2023, we will update it weekly |
Data access¶
[6]:
ds=cat.atmosphere.era5_reanalysis_single_levels.to_dask()
ds
[6]:
<xarray.Dataset>
Dimensions: (latitude: 721, longitude: 1440, time: 368184)
Coordinates:
* latitude (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
* longitude (longitude) float32 -180.0 -179.8 -179.5 ... 179.2 179.5 179.8
* time (time) datetime64[ns] 1979-01-01 ... 2020-12-31T23:00:00
Data variables:
t2m (time, latitude, longitude) float32 dask.array<chunksize=(14880, 15, 15), meta=np.ndarray>
tp (time, latitude, longitude) float32 dask.array<chunksize=(14880, 15, 15), meta=np.ndarray>
Attributes:
institution: ECMWF
source: Reanalysis
title: ERA5 forecastsWorking with the data¶
We can quickly choose data subsets in both space and time using xarray. Here, we choose July 19–20, 1996, a period when Quebec saw historically extreme precipitation (Canada). The graphic package hvplot can then be used to track the storm throughout the event.
[7]:
%%time
da = ds.tp \
.sel(time=slice('1996-07-19','1996-07-20'),
longitude=slice(-90,-50),
latitude=slice(60,35))
da \
.where(da>=0.001) \
.load() \
.hvplot(groupby='time',
widget_type='scrubber',
widget_location='bottom',
cmap='gist_ncar',
tiles='ESRI',
geo=True,
clim=(0.001, 0.005),
width=750,
height=400)
CPU times: user 4.52 s, sys: 323 ms, total: 4.84 s
Wall time: 10.1 s
[7]:
Because this zarr’s version of ERA5 is optimised for time series analysis, all historical data can be quickly extracted on a relatively small spatial extent (a point or a polygon for instance) as opposed to working with a collection of netcdf files which is typically extremely compute-intensive for large datasets due to the netcdfs being chunked in the time dimension.
[8]:
%%time
da = (1000*ds.tp) \
.sel(longitude=-75,
latitude=45,
method='nearest')
da.hvplot(grid=True, width=800, height=500, color='blue')
CPU times: user 190 ms, sys: 47.3 ms, total: 238 ms
Wall time: 1.61 s
[8]:
[9]:
%%time
da = (1000*ds.tp) \
.sel(longitude=-75,
latitude=45,
method='nearest') \
.resample(time='1Y') \
.sum()
da.hvplot.line(grid=True, width=800, height=500, color='blue')* \
da.hvplot.scatter(marker='o').opts(color='black', size=14)
CPU times: user 934 ms, sys: 20.5 ms, total: 954 ms
Wall time: 3.34 s
[9]:
b) ERA5 pressure levels¶
ERA5 is the fifth generation ECMWF atmospheric reanalysis of the global climate covering the period from January 1950 to present. ERA5 is produced by the Copernicus Climate Change Service (C3S) at ECMWF.
Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product.
Property |
Values |
|---|---|
Temporal extent: |
01/01/1979 – 12/31/2019 |
Spatial extent: |
Atlantic Northeast : [-96, -52, 40, 63] |
Chunks: |
{‘time’: 8760, ‘longitude’: 25, ‘latitude’: 25, ‘level’: 1} |
Spatial resolution: |
0.25 degrees |
Spatial reference: |
WGS84 (EPSG:4326) |
Temporal resolution: |
1 hour |
Update frequency: |
None |
[10]:
ds=cat.atmosphere.era5_reanalysis_pressure_levels.to_dask()
ds
[10]:
<xarray.Dataset>
Dimensions: (latitude: 93, level: 6, longitude: 177, time: 359400)
Coordinates:
* latitude (latitude) float32 63.0 62.75 62.5 62.25 ... 40.5 40.25 40.0
* level (level) int32 300 400 500 700 850 1000
* longitude (longitude) float32 -96.0 -95.75 -95.5 ... -52.5 -52.25 -52.0
* time (time) datetime64[ns] 1979-01-01 ... 2019-12-31T23:00:00
Data variables:
r (time, level, latitude, longitude) float32 dask.array<chunksize=(8760, 1, 25, 25), meta=np.ndarray>
t (time, level, latitude, longitude) float32 dask.array<chunksize=(8760, 1, 25, 25), meta=np.ndarray>
u (time, level, latitude, longitude) float32 dask.array<chunksize=(8760, 1, 25, 25), meta=np.ndarray>
v (time, level, latitude, longitude) float32 dask.array<chunksize=(8760, 1, 25, 25), meta=np.ndarray>
z (time, level, latitude, longitude) float32 dask.array<chunksize=(8760, 1, 25, 25), meta=np.ndarray>
Attributes:
Conventions: CF-1.6
history: 2019-12-18 03:49:32 GMT by grib_to_netcdf-2.14.0: /opt/ecmw...Working with the data¶
[11]:
%%time
ds.z \
.sel(longitude=-75, latitude=45, level=[500, 700, 850, 1000]).hvplot(grid=True, by='level')
CPU times: user 1.31 s, sys: 140 ms, total: 1.45 s
Wall time: 15.8 s
[11]: